Search CORE

26 research outputs found

A realistic assessment of methods for extracting gene/protein interactions from free text

Author: A Moschitti
AB Clegg
Adrian J Shepherd
AM Cohen
Andrew B Clegg
AS Yeh
B Settles
C Nédellec
D Rebholz-Schuhmann
H Jose
HL Johnson
J Ding
J Fluck
JD Kim
JD Kim
K Franzén
K Fundel
K Sagae
L Hunter
M Krallinger
N Domedel-Puig
R Bunescu
R Hoffmann
R Kabiljo
R Kabiljo
R Leaman
R Sætre
Renata Kabiljo
S Pyysalo
S Pyysalo
S Pyysalo
T Hara
WA Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

UCL Discovery

PubMed Central

Birkbeck Institutional Research Online

DNAscan2: a versatile, scalable, and user-friendly analysis pipeline for human next-generation sequencing data

Author: Al Khleifat A
Al-Chalabi A
Dobson RJ
Iacoangeli A
Kabiljo R
Marriott H
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/04/2023
Field of study

SUMMARY: The current widespread adoption of next-generation sequencing (NGS) in all branches of basic research and clinical genetics fields means that users with highly variable informatics skills, computing facilities and application purposes need to process, analyse, and interpret NGS data. In this landscape, versatility, scalability, and user-friendliness are key characteristics for an NGS analysis software. We developed DNAscan2, a highly flexible, end-to-end pipeline for the analysis of NGS data, which (i) can be used for the detection of multiple variant types, including SNVs, small indels, transposable elements, short tandem repeats, and other large structural variants; (ii) covers all standard steps of NGS analysis, from quality control of raw data and genome alignment to variant calling, annotation, and generation of reports for the interpretation and prioritization of results; (iii) is highly adaptable as it can be deployed and run via either a graphic user interface for non-bioinformaticians and a command line tool for personal computer usage; (iv) is scalable as it can be executed in parallel as a Snakemake workflow, and; (v) is computationally efficient by minimizing RAM and CPU time requirements. AVAILABILITY AND IMPLEMENTATION: DNAscan2 is implemented in Python3 and is available at https://github.com/KHP-Informatics/DNAscanv2

UCL Discovery

GEOexplorer: a webserver for gene expression analysis and visualisation

Author: Al-Chalabi Ammar
Barnes Michael R
Dobson Richard JB
Grassi Luigi
Henkin Rafael
Hunt Guy P
Iacoangeli Alfredo
Ibrahim Zina
Kabiljo Renata
Koks Sulev
Smeraldi Fabrizio
Spargo Thomas P
Publication venue: OXFORD UNIV PRESS
Publication date: 24/05/2022
Field of study

Gene Expression Omnibus (GEO) is a database repository hosting a substantial proportion of publicly available high throughput gene expression data. Gene expression analysis is a powerful tool to gain insight into the mechanisms and processes underlying the biological and phenotypic differences between sample groups. Despite the wide availability of gene expression datasets, their access, analysis, and integration are not trivial and require specific expertise and programming proficiency. We developed the GEOexplorer webserver to allow scientists to access, integrate and analyse gene expression datasets without requiring programming proficiency. Via its user-friendly graphic interface, users can easily apply GEOexplorer to perform interactive and reproducible gene expression analysis of microarray and RNA-seq datasets, while producing a wealth of interactive visualisations to facilitate data exploration and interpretation, and generating a range of publication ready figures. The webserver allows users to search and retrieve datasets from GEO as well as to upload user-generated data and combine and harmonise two datasets to perform joint analyses. GEOexplorer, available at https://geoexplorer.rosalind.kcl.ac.uk, provides a solution for performing interactive and reproducible analyses of microarray and RNA-seq gene expression data, empowering life scientists to perform exploratory data analysis and differential gene expression analysis on-the-fly without informatics proficiency

UCL Discovery

RetroSnake: A modular pipeline to detect human endogenous retroviruses in genome sequencing data

Author: Al Khleifat Ahmad
Al-Chalabi Ammar
Bouton Clement R
Bowles Harry
Dobson Richard JB
Iacoangeli Alfredo
Jones Ashley R
Kabiljo Renata
Marriott Heather
Quinn John P
Swanson Chad M
Publication venue: 'Elsevier BV'
Publication date: 04/10/2022
Field of study

Human endogenous retroviruses (HERVs) integrated into the human genome as a result of ancient exogenous infections and currently comprise ∼8% of our genome. The members of the most recently acquired HERV family, HERV-Ks, still retain the potential to produce viral molecules and have been linked to a wide range of diseases including cancer and neurodegeneration. Although a range of tools for HERV detection in NGS data exist, most of them lack wet lab validation and they do not cover all steps of the analysis. Here, we describe RetroSnake, an end-to-end, modular, computationally efficient, and customizable pipeline for the discovery of HERVs in short-read NGS data. RetroSnake is based on an extensively wet-lab validated protocol, it covers all steps of the analysis from raw data to the generation of annotated results presented as an interactive html file, and it is easy to use by life scientists without substantial computational training. Availability and implementation: The Pipeline and an extensive documentation are available on GitHub

University of Liverpool Repository

PubMed Central

UCL Discovery

King's Research Portal

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

Profile of sleep disturbances in patients with recurrent depressive disorder or bipolar affective disorder in a tertiary sleep disorders service

Author: Benson J
Drakatos P
Higgins S
Kabiljo R
Kumari V
Liao Y
Nesbitt A
O’Regan D
Panayiotou C
Pool N
Romigi A
Rosenzweig I
Stokes PRA
Tahmasian M
Young AH
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/05/2023
Field of study

Supplementary Information is available online at https://www.nature.com/articles/s41598-023-36083-7#Sec13 .Copyright © The Author(s) 2023. Bidirectional relationship between sleep disturbances and affective disorders is increasingly recognised, but its underlying mechanisms are far from clear, and there is a scarcity of studies that report on sleep disturbances in recurrent depressive disorder (RDD) and bipolar affective disorder (BPAD). To address this, we conducted a retrospective study of polysomnographic and clinical records of patients presenting to a tertiary sleep disorders clinic with affective disorders. Sixty-three BPAD patients (32 female; mean age ± S.D.: 41.8 ± 12.4 years) and 126 age- and gender-matched RDD patients (62 female; 41.5 ± 12.8) were studied. Whilst no significant differences were observed in sleep macrostructure parameters between BPAD and RDD patients, major differences were observed in comorbid sleep and physical disorders, both of which were higher in BPAD patients. Two most prevalent sleep disorders, namely obstructive sleep apnoea (OSA) (BPAD 50.8.0% vs RDD 29.3%, P = 0.006) and insomnia (BPAD 34.9% vs RDD 15.0%, P = 0.005) were found to be strongly linked with BPAD. In summary, in our tertiary sleep clinic cohort, no overt differences in the sleep macrostructure between BPAD and RDD patients were demonstrated. However, OSA and insomnia, two most prevalent sleep disorders, were found significantly more prevalent in patients with BPAD, by comparison to RDD patients. Also, BPAD patients presented with significantly more severe OSA, and with higher overall physical co-morbidity. Thus, our findings suggest an unmet/hidden need for earlier diagnosis of those with BPAD.Wellcome Trust [103952/Z/14/Z]

Brunel University Research Archive

A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature

The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Looking at Cerebellar Malformations through Text-Mined Interactomes of Mice and Humans

Author: A Jalali
A Rzhetsky
A Rzhetsky
A Rzhetsky
A Rzhetsky
A Rzhetsky
Andrey Rzhetsky
AO Wilkie
BW Soong
C Alfarano
C Friedman
C Friedman
C Kioussi
CJ Bult
CJ Bult
CL Smith
D Jiao
D Lang
D Tsavachidou
EJ Robson
Eran Segal
G Deda
GR Mishra
H Akil
H Lee
H Ueno
I Vastrik
Ilya Mayzus
Ivan Iossifov
J Alder
J Bandy
J Glienke
J Mestas
JD Schmahmann
JE Leestma
JF Rual
K Scearce-Levie
K Venkatesan
KA Waite
Kathleen J. Millen
KJ Millen
L Hunter
L Salwinski
L Sztriha
M Cokol
M Huang
M Krallinger
M Krallinger
M Krauthammer
M Krauthammer
M Krauthammer
M Nei
MA Basson
MA Parisi
ME Cusick
MS LeDoux
N Heintz
P Grimaldi
R Chowdhary
R Kabiljo
R Mizuguchi
R Rodriguez-Esteban
R Russo
Raul Rodriguez-Esteban
RV Sillitoe
S Mathivanan
S Papageorgiou
S Peri
SM Leach
TH Davenport
U Stelzl
VV Chizhikov
WM Fitch
X Li
Y Benjamini
Y Benjamini
Y Garten
Y Katoh
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2009
Field of study

WE HAVE GENERATED AND MADE PUBLICLY AVAILABLE TWO VERY LARGE NETWORKS OF MOLECULAR INTERACTIONS: 49,493 mouse-specific and 52,518 human-specific interactions. These networks were generated through automated analysis of 368,331 full-text research articles and 8,039,972 article abstracts from the PubMed database, using the GeneWays system. Our networks cover a wide spectrum of molecular interactions, such as bind, phosphorylate, glycosylate, and activate; 207 of these interaction types occur more than 1,000 times in our unfiltered, multi-species data set. Because mouse and human genes are linked through an orthological relationship, human and mouse networks are amenable to straightforward, joint computational analysis. Using our newly generated networks and known associations between mouse genes and cerebellar malformation phenotypes, we predicted a number of new associations between genes and five cerebellar phenotypes (small cerebellum, absent cerebellum, cerebellar degeneration, abnormal foliation, and abnormal vermis). Using a battery of statistical tests, we showed that genes that are associated with cerebellar phenotypes tend to form compact network clusters. Further, we observed that cerebellar malformation phenotypes tend to be associated with highly connected genes. This tendency was stronger for developmental phenotypes and weaker for cerebellar degeneration

CiteSeerX

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

PubMed Central

Biomedical Text Mining and Its Applications

Crossref

Directory of Open Access Journals

PubMed Central

Protein name tagging in the immunological domain

Author: Kabiljo R.
Shepherd Adrian J.
Publication venue
Publication date: 01/01/2008
Field of study

The research described in this paper addresses the following question: How well do generic protein/gene name taggers perform when they are applied to full-text articles from the sub-domain of immunology (a subdomain with its own distinctive protein nomenclature)? To answer this question we have created a new corpus – ImmunoTome – consisting of ten full-text immunological articles in which the names of proteins have been manually annotated. Our results show that a single tagger – ABNER trained on the BioCreAtivE corpus – performs significantly better than the other taggers we evaluated when applied to ImmunoTome. ImmunoTome is available from immunominer.cryst.bbk.ac.uk/tome.html

Birkbeck Institutional Research Online